Adjust `ForkJoinPool` prefix in `HdfsClientThreadLeakFilter` #127534

JeremyDahlgren · 2025-04-29T21:34:30Z

Adds the ForkJoinPool.commonPool-worker- prefix to the Thread getName().startsWith() checks in
HdfsClientThreadLeakFilter to pick up .commonPool-worker-N threads as well. This resolves the
"There are still zombie threads that couldn't be terminated" errors in the Hdfs IT tests:

    com.carrotsearch.randomizedtesting.ThreadLeakError: 4 threads leaked from SUITE scope at org.elasticsearch.repositories.hdfs.HdfsTests: 
       1) Thread[id=53, name=ForkJoinPool.commonPool-worker-1, state=WAITING, group=TGRP-HdfsTests]
            at java.base/jdk.internal.misc.Unsafe.park(Native Method)
            at java.base/java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:2143)
            at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:2034)
            at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:187)

Closes #127290
Closes #127289
Closes #127288
Closes #127287

Changes "ForkJoinPool-" to "ForkJoinPool." in the Thread getName().startsWith() checks in HdfsClientThreadLeakFilter. This resolves the "There are still zombie threads that couldn't be terminated" errors in the Hdfs IT tests. Closes elastic#127290 Closes elastic#127289 Closes elastic#127288 Closes elastic#127287

elasticsearchmachine · 2025-04-29T21:34:54Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd · 2025-04-30T00:26:48Z

Out of curiosity, do you know why the test started failing recently? The hadoop version is the same between 9.0 and 9.1. So I wonder what else has change to cause the failures?

nicktindall · 2025-04-30T00:44:33Z

Out of curiosity, do you know why the test started failing recently? The hadoop version is the same between 9.0 and 9.1. So I wonder what else has change to cause the failures?

They all seem to be openjdk21 builds, I wonder if the JDK changed the name of the ForkJoinPool threads?

ywangd · 2025-04-30T01:49:30Z

Does not look like JDK version related to me since the tests do not fail with JDK21 and 8.18.

ldematte · 2025-04-30T07:36:26Z

I've looked back at the JDK source and the ForkJoinPool.commonPool-worker prefix has been like this since at least JDK 17, so I also wonder why tests started failing now -- maybe we tightened the checks? Or maybe we (hadoop) were not leaking any ForkJoinPool thread before?

ldematte

I would consider using ForkJoinPool.commonPool-worker- as a prefix, to be consistent with the JDK/ with our own code (see TestContainersThreadFilter), but in any case this looks OK to me.

nicktindall · 2025-05-01T01:19:34Z

It looks like for some reason we've moved from using a specific ForkJoinPool instance to using the "common fork join pool"

See that when you create a ForkJoinPool with a specific config, the workerNamePrefix gets set to ForkJoinPool-{pid}-worker-..., which matches the existing pattern.

But the common fork join pool does not have a workerNamePrefix, which causes it to default to ForkJoinPool.commonPool-worker-.... This deviates from the existing pattern.

I tried to reproduce it locally but I can't seem to get leaks from any fork join pool, so it's hard to pin down where hadoop is using it and why that behaviour changed.

Hadoop seems to use both kinds of ForkJoinPool (common and specific) fairly extensively

nicktindall

It would be nice to have an explanation for why this started failing recently, but being the common fork join pool those threads really could be coming from anywhere.

LGTM unless anyone thinks it's worth holding off until we have an explanation?

JeremyDahlgren · 2025-05-01T22:26:25Z

It would be nice to have an explanation for why this started failing recently, but being the common fork join pool those threads really could be coming from anywhere.

I trimmed down HdfsTests.testSimpleWorkflow() to just creating the repository:

    public void testSimpleWorkflow() {
        Client client = client();
        assertAcked(
            client.admin()
                .cluster()
                .preparePutRepository(TEST_REQUEST_TIMEOUT, TEST_REQUEST_TIMEOUT, "test-repo")
                .setType("hdfs")
                .setSettings(
                    Settings.builder()
                        .put("uri", "hdfs:///")
                        .put("conf.fs.AbstractFileSystem.hdfs.impl", TestingFs.class.getName())
                        .put("path", "foo")
                        .put("chunk_size", randomIntBetween(100, 1000) + "k")
                        .put("compress", randomBoolean())
                )
        );

This fails with the commonPool threads leaked for JDK 22 and 23, but not for JDK 21 and 24 on my machine:

for JDK in 21 22 23 24 ; do ./gradlew ":plugins:repository-hdfs:test" --tests "org.elasticsearch.repositories.hdfs.HdfsTests.testSimpleWorkflow" -Druntime.java=$JDK ; done

Per the ForkJoinPool.commonPool() docs:

This pool is statically constructed; its run state is unaffected by attempts to `shutdown()` or `shutdownNow()`. However this pool and any ongoing processing are automatically terminated upon program `System.exit(int)`.

So it seems reasonable to filter the commonPool threads. But yes, the question remains what the difference is running between these JDKs.

nicktindall · 2025-05-02T06:42:23Z

So it seems reasonable to filter the commonPool threads. But yes, the question remains what the difference is running between these JDKs.

I think it's a safe thing to suppress. The use of the common pool could be coming from the JVM libraries. You could stick a breakpoint in the lazy initialisation perhaps? that might reveal where it's coming from. But if it's inconclusive I don't know if it's worth spending a lot of time on?

JeremyDahlgren · 2025-05-02T21:33:05Z

So it seems reasonable to filter the commonPool threads. But yes, the question remains what the difference is running between these JDKs.

I think it's a safe thing to suppress. The use of the common pool could be coming from the JVM libraries. You could stick a breakpoint in the lazy initialisation perhaps? that might reveal where it's coming from. But if it's inconclusive I don't know if it's worth spending a lot of time on?

The ForkJoinPool.commonPool threads are created from down in Hadoop statistics that use parallelStream(), roughly from running in the debugger with a JDK 21:

HdfsBlobContainer.writeToPath():265-267:

has a try/finally that closes the FSDataOutputStream,
that closes the BufferedIOStatisticsOutputStream,
that closes the RawLocalFileSystem$LocalFSFileOutputStream,
that has a finally block that cals IOStatisticsAggregator.aggregate()
which calls IOStatisticsSnapshot.aggregate(), which does several calls to
IOStatisticsBinding.aggregateMaps() that calls
EvaluatingStatisticsMap.entrySet() has evalEntries.parallelStream().map(...).collect(...)
—————— JDK —————
ReferencePipeline.collect()
AbstractPipeline.evaluate()
ReduceOps.evaluateParallel()
ForkJoinTask.invoke()
ForkJoinTask.doExec()
CountedCompleter.exec()
AbstractTAsk.compute()
ForkJoinTask.fork()
DefaultForkJoinWorkerThreadFactory.newCommonWithACC(pool)
InnocuousForkJoinWorkerThread
ForkJoinWorkerThread()
createWorker()

Note that ForkJoinPool.commonPool-worker- is also filtered in TestContainersThreadFilter.

Also it looks like there is a change in DefaultForkJoinWorkerThreadFactory.newThread() between 21 and 24 relating to the security manager. The timing of the test failures lines up with other test failures and updates going on in core infra with security manager and entitlements changes.

…#127534) Adds the ForkJoinPool.commonPool-worker- prefix to the Thread getName().startsWith() checks in HdfsClientThreadLeakFilter. This resolves the "There are still zombie threads that couldn't be terminated" errors in the Hdfs IT tests. Closes elastic#127290 Closes elastic#127289 Closes elastic#127288 Closes elastic#127287 (cherry picked from commit 4408e38)

JeremyDahlgren · 2025-05-04T11:41:28Z

💚 All backports created successfully

Status	Branch	Result
✅	8.19

Questions ?

Please refer to the Backport tool documentation

…#127534) Adds the ForkJoinPool.commonPool-worker- prefix to the Thread getName().startsWith() checks in HdfsClientThreadLeakFilter. This resolves the "There are still zombie threads that couldn't be terminated" errors in the Hdfs IT tests. Closes elastic#127290 Closes elastic#127289 Closes elastic#127288 Closes elastic#127287 (cherry picked from commit 4408e38)

JeremyDahlgren · 2025-05-04T11:43:20Z

💚 All backports created successfully

Status	Branch	Result
✅	9.0

Questions ?

Please refer to the Backport tool documentation

…127683) Adds the ForkJoinPool.commonPool-worker- prefix to the Thread getName().startsWith() checks in HdfsClientThreadLeakFilter. This resolves the "There are still zombie threads that couldn't be terminated" errors in the Hdfs IT tests. Closes #127676 Closes #127677 Closes #127678 Closes #127679 (cherry picked from commit 4408e38)

…127684) Adds the ForkJoinPool.commonPool-worker- prefix to the Thread getName().startsWith() checks in HdfsClientThreadLeakFilter. This resolves the "There are still zombie threads that couldn't be terminated" errors in the Hdfs IT tests. (cherry picked from commit 4408e38)

…#127534) Adds the ForkJoinPool.commonPool-worker- prefix to the Thread getName().startsWith() checks in HdfsClientThreadLeakFilter. This resolves the "There are still zombie threads that couldn't be terminated" errors in the Hdfs IT tests. Closes elastic#127290 Closes elastic#127289 Closes elastic#127288 Closes elastic#127287

…#127534) Adds the ForkJoinPool.commonPool-worker- prefix to the Thread getName().startsWith() checks in HdfsClientThreadLeakFilter. This resolves the "There are still zombie threads that couldn't be terminated" errors in the Hdfs IT tests. Closes elastic#127290 Closes elastic#127289 Closes elastic#127288 Closes elastic#127287 (cherry picked from commit 4408e38)

JeremyDahlgren · 2025-05-22T13:29:39Z

💚 All backports created successfully

Status	Branch	Result
✅	8.18

Questions ?

Please refer to the Backport tool documentation

#128309) Adds the ForkJoinPool.commonPool-worker- prefix to the Thread getName().startsWith() checks in HdfsClientThreadLeakFilter. This resolves the "There are still zombie threads that couldn't be terminated" errors in the Hdfs IT tests. Closes #128305 Closes #128306 Closes #128307 Closes #128308 (cherry picked from commit 4408e38)

JeremyDahlgren added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0 labels Apr 29, 2025

JeremyDahlgren requested review from breskeby, ldematte, nicktindall and nielsbauman April 29, 2025 21:37

Filter by ForkJoinPool

d8ed34b

nielsbauman removed their request for review April 30, 2025 06:10

ldematte reviewed Apr 30, 2025

View reviewed changes

Use ForkJoinPool.commonPool-worker- for new prefix addition

164c33c

nicktindall approved these changes May 1, 2025

View reviewed changes

JeremyDahlgren merged commit 4408e38 into elastic:main May 2, 2025
17 checks passed

JeremyDahlgren added v8.19.0 auto-backport Automatically create backport pull requests when merged and removed auto-backport Automatically create backport pull requests when merged labels May 4, 2025

JeremyDahlgren mentioned this pull request May 4, 2025

[8.19] Adjust ForkJoinPool prefix in HdfsClientThreadLeakFilter (#127534) #127683

Merged

JeremyDahlgren mentioned this pull request May 4, 2025

[9.0] Adjust ForkJoinPool prefix in HdfsClientThreadLeakFilter (#127534) #127684

Merged

JeremyDahlgren added the v9.0.2 label May 4, 2025

JeremyDahlgren mentioned this pull request May 22, 2025

[8.18] Adjust ForkJoinPool prefix in HdfsClientThreadLeakFilter (#127534) #128309

Merged

Adjust ForkJoinPool prefix in HdfsClientThreadLeakFilter #127534

Adjust ForkJoinPool prefix in HdfsClientThreadLeakFilter #127534

Uh oh!

Conversation

JeremyDahlgren commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 29, 2025

Uh oh!

ywangd commented Apr 30, 2025

Uh oh!

nicktindall commented Apr 30, 2025

Uh oh!

ywangd commented Apr 30, 2025

Uh oh!

ldematte commented Apr 30, 2025

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

nicktindall commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

JeremyDahlgren commented May 1, 2025

Uh oh!

nicktindall commented May 2, 2025

Uh oh!

JeremyDahlgren commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JeremyDahlgren commented May 4, 2025

💚 All backports created successfully

Questions ?

Uh oh!

JeremyDahlgren commented May 4, 2025

💚 All backports created successfully

Questions ?

Uh oh!

JeremyDahlgren commented May 22, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Adjust `ForkJoinPool` prefix in `HdfsClientThreadLeakFilter` #127534

Adjust `ForkJoinPool` prefix in `HdfsClientThreadLeakFilter` #127534

JeremyDahlgren commented Apr 29, 2025 •

edited

Loading

nicktindall commented May 1, 2025 •

edited

Loading

JeremyDahlgren commented May 2, 2025 •

edited

Loading